Use historical draw results, and number of hunters to train a model we can use to predict the number of hunters in future years.
TODO - Include other potential inputs that could impact how many hunters get a license and show up. Those could include economic indicators, and costs associated with hunting (transportation, lodging).
NOTICE that I am only looking at the general rifle hunting seasons on public land. There are also hunters for Archery, Muzzleloader, Private Land, Ranching for Wildlife, etc.
Load required libraries for wrangling data, charting, and mapping
library(plyr,quietly = T) # data wrangling
library(dplyr,quietly = T) # data wrangling
library(ggplot2, quietly = T) # charting
library(ggthemes,quietly = T) # so I can add the highcharts theme and palette
library(scales,quietly = T) # to load the percent function when labeling plots
library(caret,quietly = T) # classification and regression training
library(foreach,quietly = T) # parallel processing to speed up the model training
library(doMC,quietly = T) # parallel processing to speed up the model training
library(lubridate,quietly = T) # for timing models
Set our preferred charting theme
theme_set(theme_minimal()+theme_hc()+theme(legend.key.width = unit(1.5, "cm")))
Run script to get hunter data
source('~/_code/colorado-dow/datasets/Colorado Elk Harvest Data.R', echo=F)
Table of the harvest data COElkRifleAll
Run script to get draw data
source('~/_code/colorado-dow/datasets/Elk Drawing Summaries.R', echo=F)
Table of the data
head(COElkDrawAll)
source geodata
source('~/_code/colorado-dow/datasets/Colorado GMUnit and Road data.R', echo=F)
Take a peak at the boundary data
head(Unitboundaries2)
Set to predictive analytics directory
setwd("~/_code/colorado-dow/phase III - predictive analytics")
Will start by grouping all of the seasons together, and modeling the number of hunters per Year and Unit Group Draw results data by Year and Unit
COElkDraw <- summarise(group_by(COElkDrawAll,Year,Unit),
Quota = sum(Orig_Quota,na.rm = T),
Drawn = sum(Chcs_Drawn,na.rm = T))
Appropriate field classes for model training
COElkDraw$Year <- as.numeric(COElkDraw$Year)
Group Hunter data by Year and Unit
COElkHunters <- summarise(group_by(COElkRifleAll,Year,Unit),
Hunters = sum(c(Hunters.Antlered,Hunters.Antlerless,Hunters.Either),na.rm = T))
COElkHunters$Year <- as.numeric(COElkHunters$Year)
Join in Hunter and Draw data together
COElkHunters <- left_join(COElkHunters, COElkDraw, by = c("Year","Unit"))
Replace the draw data that don’t have entries with 0
COElkHunters$Drawn[is.na(COElkHunters$Drawn)] <- 0
COElkHunters$Quota[is.na(COElkHunters$Quota)] <- 0
Split into train and test sets. Will use 75% of the data to train on. Be sure to include each unit in the split. … so do the split for each unit, first make sure each Unit has at least three entries
COElkHunters <- mutate(group_by(COElkHunters,Unit),
numentries = n())
COElkHunters <- filter(COElkHunters, numentries >= 3)
COElkHunters$UnitYear <- paste(COElkHunters$Unit, COElkHunters$Year)
traindata <- COElkHunters %>% group_by(Unit) %>% sample_frac(size = .75, replace = F)
testdata <- COElkHunters[!COElkHunters$UnitYear %in% traindata$UnitYear,]
COElkHunters <- select(COElkHunters, -UnitYear, -numentries)
traindata <- select(traindata, -UnitYear, -numentries)
testdata <- select(testdata, -UnitYear, -numentries)
Save off for importing into AzureML
write.csv(COElkHunters,file = "~/_code/colorado-dow/datasets/COElkHunters.csv",row.names = F)
notice that the number of hunters data is skewed.
ggplot(COElkHunters, aes(Hunters)) +
geom_density() +
xlab("Hunters in Unit") +
ylab("Number of Units") +
theme(axis.text.y = element_blank()) +
labs(title="Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")
A general rule of thumb to consider is that skewed data whose ratio of the highest value to the lowest value is greater than 20 have significant skewness. Also, the skewness statistic can be used as a diagnostic. If the predictor distribution is roughly symmetric, the skewness values will be close to zero. As the distribution becomes more right skewed, the skewness statistic becomes larger. Similarly, as the distribution becomes more left skewed, the value becomes negative. Replacing the data with the log, square root, or inverse may help to remove the skew.
Example of how BoxCox can redistribute the data
preProcValues2 <- preProcess(as.data.frame(traindata), method = "BoxCox")
trainBC <- predict(preProcValues2, as.data.frame(traindata))
ggplot(trainBC, aes(Hunters)) +
geom_density() +
xlab("BoxCox Hunters in Unit") +
ylab("Number of Units") +
theme(axis.text.y = element_blank()) +
labs(title="BoxCox Distribution of Hunters in each Unit", subtitle="2006-2017", caption="source: cpw.state.co.us")
caret has a preproccess function for correcting for skewness ‘BoxCox’, we will need to be sure to look at using this function in the training models.
This is quite an iterative process. It is important to document and save off data. Run thru differing ‘quick to train’ methods Which one performed the best? Determine other similar methods and run them. Which one performed the best? Determine disimmilar methods. Which one performed the best?
Now lets work on some refined tuning. Assess preprocessing functions. center, scale, pca, boxcox, nzv, etc some units have very little data, should we remove them? ’zero variances 123, 791, 87, 94, 88 TODO, frequency plot of Unit
Run the list of other favorable methods with preprocessing in place Further tune method’s parameters
Is there a method that is best? Are some methods better at part of the data than others? maybe we combine?
Insert into AzureML to further inspections and create into a webservice If the packages or methods are not yet supported in Azure, we will need to create an R Model instead of just running an rscript.
Additonally, AzureML has some additional methods to consider, ensure we attempt to use those as well.
Step 1 - Loop through possible methods, utilizing the quicker ‘adaptive_cv’ parameter search from caret. # Consider scripting this into AzureML to make it run much faster, though there is more setup and errors to control for
quickmethods <- c("lm",'svmLinear',"svmRadial","knn","cubist","kknn","glm.nb")
step1_all <- NULL
for (imethod in quickmethods) {
step1 <- NULL
start <- now()
if (imethod == "lm") {
controlmethod <- "repeatedcv"
} else {controlmethod <- "adaptive_cv"}
fitControl <- trainControl(
method = controlmethod,
# search = 'random',
number = 4,
repeats = 4,
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
HuntersModel_1 = train(Hunters ~ ., data = traindata,
method = imethod,
preProc = c("center","scale"),
tuneLength = 15,
trControl = fitControl)
HuntersModel_1
# measure performance
predictdata <- predict(HuntersModel_1, testdata)
step1$method <- imethod
step1$RMSE <- postResample(pred = predictdata, obs = testdata$Hunters)[1]
step1$duration <- now() - start
step1 <- as.data.frame(step1)
step1_all <- rbind(step1_all,step1)
}
View Results
step1_all
top_two_models <- top_n(step1_all,2,-RMSE)$method
Take the top two and determine some additonal methods to try by maximizing the Jaccard dissimilarity between sets of models
tag <- read.csv("tag_data.csv", row.names = 1)
tag <- as.matrix(tag)
regModels <- tag[tag[,"Regression"] == 1,]
all <- 1:nrow(regModels)
dissimilarmethods_all <- NULL
for (itoptwo in 1:2) {
## Seed the analysis with the model of interest
start <- grep(top_two_models[itoptwo], rownames(regModels), fixed = TRUE)
pool <- all[all != start]
## Select 4 model models by maximizing the Jaccard
## dissimilarity between sets of models
nextMods <- maxDissim(regModels[start,,drop = FALSE],
regModels[pool, ],
method = "Jaccard",
n = 4)
rownames(regModels)[c(nextMods)]
dissimilarmethods <- rownames(regModels)[nextMods]
dissimilarmethods <- str_extract(string = dissimilarmethods,pattern = "[:alnum:]+(?=\\))")
dissimilarmethods_all <- c(dissimilarmethods_all,dissimilarmethods)
}
Now we have 8 more methods to try in the same manner
dissimilarmethods_all <- unique(dissimilarmethods_all)
for (imethod in dissimilarmethods_all) {
step1 <- NULL
start_timer <- now()[1]
if (imethod == "lm") {
controlmethod <- "repeatedcv"
} else {controlmethod <- "adaptive_cv"}
fitControl <- trainControl(
method = controlmethod,
# search = 'random',
number = 4,
repeats = 4,
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
HuntersModel_1 = train(Hunters ~ ., data = traindata,
method = imethod,
preProc = c("center","scale"),
tuneLength = 15,
trControl = fitControl)
HuntersModel_1
# measure performance
predictdata <- predict(HuntersModel_1, testdata)
step1$method <- imethod
step1$RMSE <- postResample(pred = predictdata, obs = testdata$Hunters)[1]
step1$duration <- now()[1] - start_timer[1]
step1 <- as.data.frame(step1)
step1_all <- rbind(step1_all,step1)
}
step1_all
method RMSE duration
RMSE lm 165.0025 6.866598 secs
RMSE1 svmLinear 178.0942 4.950974 secs
RMSE2 svmRadial 134.0561 22.081016 secs
RMSE3 knn 733.8210 10.502875 secs
RMSE4 cubist 151.9572 3.057477 secs
RMSE5 kknn 132.0211 9.778734 secs
RMSE6 cubist 160.5464 1.027274 secs
Now lets work on some refined tuning on the top methods Any valuable preprocessing steps?
preprocessfunctions <- c("BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica", "spatialSign", "corr", "zv", "nzv")
topmethods <- top_n(step1_all,2,-RMSE)$method
fitControl <- trainControl(
method = "adaptive_cv", #repeatedcv,
search = 'random',
number = 10, #4
repeats = 10, #10
# classProbs = TRUE,
# savePred = TRUE,
allowParallel = TRUE,
summaryFunction = defaultSummary)
PPperformance_all <- NULL
PPperformance <- NULL
for (imethod in topmethods) {
for (ipreprocess in preprocessfunctions) {
registerDoSEQ()
registerDoMC(cores = 6)
PreProcessModel = train(Hunters ~ ., data = traindata,
method = imethod,
preProc = ipreprocess,
#tuneLength = 10,
#tuneGrid = kknnTuneGrid,
trControl = fitControl)
print(PreProcessModel)
# check performance
predictdata <- predict(PreProcessModel, testdata)
PPperformance$method <- imethod
PPperformance$preprocess <- ipreprocess
PPperformance$RMSE <- postResample(pred = predictdata, obs = testdata$Hunters)[1]
PPperformance <- as.data.frame(PPperformance)
PPperformance_all <- rbind(PPperformance_all,PPperformance)
}
}
PPperformance_all
Error: object 'PPperformance_all' not found
Output from AzureML [ModuleOutput] method preprocess RMSE [ModuleOutput] RMSE kknn BoxCox 130.7939 [ModuleOutput] RMSE1 kknn YeoJohnson 130.9600 [ModuleOutput] RMSE2 kknn center 130.7331 [ModuleOutput] RMSE3 kknn scale 130.1818 [ModuleOutput] RMSE4 kknn pca 130.2071 [ModuleOutput] RMSE5 svmRadial BoxCox 154.0898 [ModuleOutput] RMSE6 svmRadial YeoJohnson 169.9816 [ModuleOutput] RMSE7 svmRadial center 154.1891 [ModuleOutput] RMSE8 svmRadial scale 154.1000 [ModuleOutput] RMSE9 svmRadial pca 164.0881 svmRadial and kknn don’t perform better with any of the preprocessing functions in place
Now we can review the predictors, there are only a few fields so I will manually test performance while excluding each of them to monitor their importance. Some of our fields are instinctively required (Year, Unit)
Predictors <- c("Quota","Drawn")
Predictorperformance_all <- NULL
Predictorperformance <- NULL
for (imethod in topmethods) {
for (ipredictor in Predictors) {
registerDoSEQ()
registerDoMC(cores = 6)
PredictorModel = train(Hunters ~ ., data = select(traindata,-ipredictor),
method = imethod,
tuneLength = 15,
trControl = fitControl)
print(PredictorModel)
# check performance
predictdata <- predict(PredictorModel, testdata)
Predictorperformance$method <- imethod
Predictorperformance$missing_predictor <- ipredictor
Predictorperformance$RMSE <- postResample(pred = predictdata, obs = testdata$Hunters)[1]
Predictorperformance <- as.data.frame(Predictorperformance)
Predictorperformance_all <- rbind(Predictorperformance_all,Predictorperformance)
}
}
Predictorperformance_all
method missing_predictor RMSE
RMSE svmRadial Quota 152.5027
RMSE1 svmRadial Drawn 153.7358
RMSE2 kknn Quota 131.1298
RMSE3 kknn Drawn 134.1082
RMSE4 kknn Quota and Drawn 130.3965
svMRadial will perform better with all of the predictors, while kknn performs better with only Unit and Year fields
Use above information to test out various combinations of preprocessing and predictor sets
kknn without Quota and Drawn
fitControl <- trainControl(
method = "adaptive_cv", #repeatedcv,
search = 'random',
number = 10, #4
repeats = 10, #10
# classProbs = TRUE,
# savePred = TRUE,
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
kknnModel = train(Hunters ~ ., data = select(COElkHunters,-Quota, -Drawn),
method = "kknn",
tuneLength = 75,
trControl = fitControl)
kknnModel
k-Nearest Neighbors
1540 samples
2 predictor
No pre-processing
Resampling: Adaptively Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1387, 1387, 1388, 1386, 1385, 1385, ...
Resampling results across tuning parameters:
kmax distance kernel RMSE Rsquared MAE Resamples
20 2.2190668 epanechnikov 131.6756 0.9799921 83.10401 5
23 2.5546249 cos 131.4289 0.9800702 82.94570 5
26 1.2077079 triweight 130.1354 0.9802204 82.38677 100
35 1.4783952 inv 130.6411 0.9803881 82.92012 5
37 0.2302676 biweight 128.8218 0.9796468 82.61469 9
39 2.9238407 epanechnikov 131.6756 0.9799921 83.10401 5
40 1.9249761 epanechnikov 131.6756 0.9799921 83.10400 5
45 0.5708073 cos 131.4271 0.9800698 82.95231 5
46 1.5130392 rectangular 132.7737 0.9797456 84.93577 5
47 1.5066909 inv 130.6411 0.9803881 82.92012 5
51 1.6936498 triangular 131.0532 0.9801930 82.70696 5
59 0.4340342 inv 130.6411 0.9803881 82.92012 5
64 0.8707815 biweight 128.8075 0.9796519 82.54501 9
69 0.4283565 rectangular 132.7737 0.9797456 84.93577 5
72 0.1078861 gaussian 131.8750 0.9800032 84.17641 5
72 1.8257081 rectangular 132.7737 0.9797456 84.93577 5
94 1.5894162 triweight 128.1327 0.9814758 80.97408 6
95 2.2151232 epanechnikov 131.6756 0.9799921 83.10401 5
100 0.3817441 gaussian 131.8701 0.9800047 84.16123 5
108 1.2081158 cos 131.4288 0.9800702 82.94535 5
111 2.4803098 inv 130.6411 0.9803881 82.92012 5
118 0.5056196 biweight 128.8059 0.9796519 82.54077 9
122 1.0827835 biweight 128.4338 0.9814240 81.35076 6
126 0.9625002 triangular 131.0528 0.9801930 82.70613 5
130 2.8021649 triangular 131.0532 0.9801930 82.70698 5
151 1.2865492 triweight 130.1354 0.9802204 82.38677 100
155 0.3185608 triangular 131.0572 0.9801902 82.73030 5
158 2.9508586 rectangular 132.7737 0.9797456 84.93577 5
164 0.3406626 biweight 128.8112 0.9796498 82.56733 9
175 1.8822426 gaussian 131.8656 0.9800062 84.14435 5
180 0.7829322 inv 130.6411 0.9803881 82.92012 5
183 2.8578463 triweight 128.1327 0.9814758 80.97408 6
196 0.9738625 biweight 128.8076 0.9796519 82.54521 9
197 1.9103930 triangular 131.0532 0.9801930 82.70697 5
208 1.7948344 rectangular 132.7737 0.9797456 84.93577 5
209 2.8788390 epanechnikov 131.6756 0.9799921 83.10401 5
216 0.1616174 cos 131.4495 0.9800625 83.03523 5
219 1.5906034 triweight 128.1327 0.9814758 80.97408 6
224 2.5172216 triangular 131.0532 0.9801930 82.70698 5
226 2.4874363 inv 131.6418 0.9800365 82.90705 5
232 0.4289729 triangular 131.0522 0.9801921 82.71857 5
232 1.1267958 rectangular 132.7737 0.9797456 84.93577 5
239 0.2772930 biweight 128.4484 0.9814182 81.42926 6
244 0.7280230 triweight 128.1326 0.9814758 80.97385 6
247 0.3054103 gaussian 131.8722 0.9800041 84.16786 5
257 0.4641430 inv 130.6411 0.9803881 82.92012 5
313 1.5046697 triweight 130.1354 0.9802204 82.38677 100
317 1.6199505 triangular 131.0531 0.9801930 82.70695 5
319 1.6611979 triweight 130.1354 0.9802204 82.38677 100
323 2.2840918 cos 131.4289 0.9800702 82.94570 5
327 2.4589409 gaussian 131.8656 0.9800062 84.14434 5
338 0.5952266 inv 130.6411 0.9803881 82.92012 5
341 0.8479846 biweight 128.4336 0.9814240 81.35023 6
345 2.3194921 cos 131.4289 0.9800702 82.94570 5
348 1.6925820 cos 131.4289 0.9800702 82.94567 5
371 2.5390863 rectangular 132.7737 0.9797456 84.93577 5
377 0.1170220 gaussian 131.8750 0.9800032 84.17640 5
380 0.6688485 triweight 130.1354 0.9802204 82.38664 100
386 2.5533768 gaussian 131.8656 0.9800062 84.14434 5
390 0.7606599 inv 130.6411 0.9803881 82.92012 5
391 1.4796130 gaussian 131.8656 0.9800062 84.14437 5
396 2.4921411 triangular 131.0532 0.9801930 82.70698 5
402 1.9321351 cos 131.4289 0.9800702 82.94569 5
429 2.2906721 epanechnikov 131.6756 0.9799921 83.10401 5
432 2.4669464 epanechnikov 131.6756 0.9799921 83.10401 5
435 1.9112202 rectangular 132.7737 0.9797456 84.93577 5
460 1.0723682 cos 131.4286 0.9800702 82.94498 5
460 1.4085408 triweight 128.1327 0.9814758 80.97408 6
465 0.1073528 triweight 127.8562 0.9801797 82.07937 10
469 0.9612949 epanechnikov 131.6750 0.9799922 83.10237 5
476 0.6659399 triangular 131.0518 0.9801930 82.70557 5
477 1.1059912 gaussian 131.8657 0.9800062 84.14460 5
477 2.3521826 inv 130.6411 0.9803881 82.92012 5
500 0.7270984 rectangular 132.7737 0.9797456 84.93577 5
508 0.3723910 biweight 128.8089 0.9796506 82.55357 9
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were kmax = 380, distance = 0.6688485 and kernel = triweight.
Best RMSE, not sure why caret is selecting parameters with higher RMSE, lets select manually
RSMEkknn <- filter(kknnModel$results, RMSE == min(RMSE))
RSMEkknn$kernel <- as.character(RSMEkknn$kernel)
run again with a tune grid
kknnTuneGrid <- data.frame(kmax = c(RSMEkknn$kmax,RSMEkknn$kmax,RSMEkknn$kmax,RSMEkknn$kmax,RSMEkknn$kmax),
distance = c(RSMEkknn$distance*.7,RSMEkknn$distance*.9,RSMEkknn$distance,RSMEkknn$distance*1.1,RSMEkknn$distance*1.3),
kernel = c(RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel))
fitControl <- trainControl(
method = "repeatedcv", #repeatedcv,
number = 10, #4
repeats = 10, #10
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
kknnGridModel = train(Hunters ~ ., data = select(COElkHunters,-Quota, -Drawn),
method = "kknn",
tuneGrid = kknnTuneGrid,
trControl = fitControl)
kknnGridModel
k-Nearest Neighbors
1540 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1386, 1388, 1388, 1384, 1388, 1385, ...
Resampling results across tuning parameters:
distance RMSE Rsquared MAE
0.07514696 128.2060 0.9808475 81.25859
0.09661752 128.2060 0.9808476 81.25851
0.10735280 128.2060 0.9808476 81.25846
0.11808808 128.2059 0.9808476 81.25838
0.13955864 128.2058 0.9808476 81.25803
Tuning parameter 'kmax' was held constant at a value of 465
Tuning parameter 'kernel' was held constant at a value
of triweight
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were kmax = 465, distance = 0.1395586 and kernel = triweight.
Best RMSE, not sure why caret is selecting parameters with higher RMSE, lets select manually
RSMEkknn <- filter(kknnGridModel$results, RMSE == min(RMSE))
RSMEkknn$kernel <- as.character(RSMEkknn$kernel)
run again with a tune grid
kknnTuneGrid2 <- data.frame(kmax = c(RSMEkknn$kmax*.7,RSMEkknn$kmax*.9,RSMEkknn$kmax,RSMEkknn$kmax*1.1,RSMEkknn$kmax*1.3),
distance = c(RSMEkknn$distance,RSMEkknn$distance,RSMEkknn$distance,RSMEkknn$distance,RSMEkknn$distance),
kernel = c(RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel,RSMEkknn$kernel))
registerDoSEQ()
registerDoMC(cores = 6)
kknnGridModel2 = train(Hunters ~ ., data = select(COElkHunters,-Quota, -Drawn),
method = "kknn",
tuneGrid = kknnTuneGrid2,
trControl = fitControl)
kknnGridModel2
k-Nearest Neighbors
1540 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1387, 1385, 1386, 1386, 1386, 1387, ...
Resampling results across tuning parameters:
kmax RMSE Rsquared MAE
325.5 129.3056 0.9804359 81.89613
418.5 129.3056 0.9804359 81.89613
465.0 129.3056 0.9804359 81.89613
511.5 129.3056 0.9804359 81.89613
604.5 129.3056 0.9804359 81.89613
Tuning parameter 'distance' was held constant at a value of 0.1395586
Tuning parameter 'kernel' was held constant at
a value of triweight
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were kmax = 604.5, distance = 0.1395586 and kernel = triweight.
One more time on final parameter (kernel) Best RMSE, not sure why caret is selecting parameters with higher RMSE, lets select manually
RSMEkknn <- filter(kknnGridModel2$results, RMSE == min(RMSE))[1,]
kernels <- levels(kknnModel$results$kernel)
run again with a tune grid
kknnTuneGrid3 <- data.frame(kmax = rep(465.0,8),
distance = rep(0.1395586,8),
kernel = kernels)
registerDoSEQ()
registerDoMC(cores = 6)
kknnGridModel3 = train(Hunters ~ ., data = select(COElkHunters,-Quota, -Drawn),
method = "kknn",
tuneGrid = kknnTuneGrid3,
trControl = fitControl)
kknnGridModel3
k-Nearest Neighbors
1540 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1385, 1387, 1386, 1386, 1385, 1385, ...
Resampling results across tuning parameters:
kernel RMSE Rsquared MAE
biweight 131.7906 0.9797220 83.06975
cos 132.5038 0.9795046 83.42612
epanechnikov 132.3542 0.9795478 83.44426
gaussian 132.2517 0.9795637 83.47951
inv 131.4340 0.9798233 82.78442
rectangular 133.2426 0.9792578 84.07603
triangular 132.3594 0.9795341 83.35722
triweight 130.7170 0.9800337 82.49768
Tuning parameter 'kmax' was held constant at a value of 465
Tuning parameter 'distance' was held constant at a value
of 0.1395586
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were kmax = 465, distance = 0.1395586 and kernel = triweight.
RSMEkknn <- filter(kknnGridModel3$results, RMSE == min(RMSE))
Best RMSE for kknn thus far
RSMEkknn <- filter(kknnModel$results, RMSE == min(RMSE))
Work thru some resampling methods with best kknn params
kknnTuneGrid4 <- data.frame(kmax = RSMEkknn$kmax,
distance = RSMEkknn$distance,
kernel = as.character(RSMEkknn$kernel))
trainmethods <- c("boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV", "none")
trainmethodperformance_all <- NULL
for (itrainmethod in trainmethods) {
trainmethodperformance <- NULL
fitControl <- trainControl(
method = itrainmethod,
number = 10, #4
repeats = 10, #10
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
kknnTrainModel = train(Hunters ~ ., data = select(COElkHunters,-Quota, -Drawn),
method = "kknn",
tuneGrid = kknnTuneGrid4,
trControl = fitControl)
print(kknnTrainModel)
trainmethodperformance <- filter(kknnTrainModel$results, RMSE == min(RMSE))
trainmethodperformance$trainmethod <- itrainmethod
trainmethodperformance_all <- rbind.fill(trainmethodperformance_all,trainmethodperformance)
}
trainmethodperformance_all
sigma C RMSE Rsquared MAE RMSESD RsquaredSD MAESD trainmethod RMSEApparent
1 0.0037653 2048 360.4005 0.7726028 231.54142 282.30458 0.298028249 191.167393 boot NA
2 0.0037653 2048 193.7360 0.9278034 123.00804 200.64763 0.209256870 132.918016 boot632 94.5999
3 0.0037653 2048 131.0365 0.9805517 83.53679 19.64063 0.008462524 7.577739 optimism_boot 94.5999
4 0.0037653 2048 135.4261 0.9789121 92.45378 11.02942 0.003095162 5.487161 cv NA
5 0.0037653 2048 141.6488 0.9759367 94.08007 32.98579 0.017374430 9.333094 repeatedcv NA
6 0.0037653 2048 133.2932 0.9794390 90.18903 NA NA NA LOOCV NA
7 0.0037653 2048 150.4885 0.9732360 99.97503 13.06248 0.004776229 6.047319 LGOCV NA
RsquaredApparent MAEApparent RMSEOptimism RsquaredOptimism MAEOptimism
1 NA NA NA NA NA
2 0.9898111 69.14552 NA NA NA
3 0.9898111 69.14552 36.43665 -0.009259344 14.39127
4 NA NA NA NA NA
5 NA NA NA NA NA
6 NA NA NA NA NA
7 NA NA NA NA NA
fitControl <- trainControl(
method = "optimism_boot",
number = 10, #4
allowParallel = TRUE,
summaryFunction = defaultSummary)
kknnFinalTrainModel = train(Hunters ~ ., data = COElkHunters,
method = "kknn",
tuneGrid = kknnTuneGrid4,
trControl = fitControl)
save off for future loading
save(kknnFinalTrainModel, file = "~/_code/colorado-dow/datasets/kknnFinalTrainModel.RData")
back to train vs test data for one more performance measure and chart… even though for future data we will use the final trained model
kknnTrainModel = train(Hunters ~ ., data = traindata,
method = "kknn",
tuneGrid = kknnTuneGrid4,
trControl = fitControl)
check performance
predictdata <- predict(kknnTrainModel, testdata)
postResample(pred = predictdata, obs = testdata$Hunters)
Chart performance of predicted
chartperformance <- data.frame(predicted = predictdata, observed = testdata$Hunters)
ggplot(chartperformance, aes(predicted,observed)) +
geom_point() +
labs(title="Performance of Number of Hunters Prediction", caption="source: cpw.state.co.us")
Output from AzureML [ModuleOutput] Support Vector Machines with Radial Basis Function Kernel [ModuleOutput] [ModuleOutput] 1540 samples [ModuleOutput] 4 predictors [ModuleOutput] [ModuleOutput] No pre-processing [ModuleOutput] Resampling: Cross-Validated (10 fold, repeated 10 times) [ModuleOutput] [ModuleOutput] Summary of sample sizes: 1386, 1386, 1387, 1387, 1385, 1385, … [ModuleOutput] [ModuleOutput] Resampling results across tuning parameters: [ModuleOutput] [ModuleOutput] C RMSE Rsquared RMSE SD Rsquared SD [ModuleOutput] 0.25 276 0.946 30.6 0.00822
[ModuleOutput] 0.5 203 0.96 21.4 0.00754
[ModuleOutput] 1 180 0.965 17.7 0.00671
[ModuleOutput] 2 168 0.969 15.7 0.00614
[ModuleOutput] 4 158 0.972 14.9 0.0055
[ModuleOutput] 8 150 0.975 14.7 0.00506
[ModuleOutput] 16 146 0.976 14.7 0.00481
[ModuleOutput] 32 144 0.976 14.7 0.00477
[ModuleOutput] 64 143 0.977 14.6 0.00474
[ModuleOutput] 128 140 0.977 14.3 0.00469
[ModuleOutput] 256 139 0.978 15 0.00485
[ModuleOutput] 512 137 0.978 14.9 0.00484
[ModuleOutput] 1020 136 0.979 15.3 0.00494
[ModuleOutput] 2050 135 0.979 15.6 0.00513
[ModuleOutput] 4100 136 0.979 15.5 0.0051
[ModuleOutput] 8190 137 0.978 15.7 0.00518
[ModuleOutput] 16400 139 0.978 16.5 0.00551
[ModuleOutput] 32800 141 0.977 17.6 0.006
[ModuleOutput] 65500 145 0.976 19.4 0.00659
[ModuleOutput] 131000 151 0.974 20.8 0.00718
[ModuleOutput] 262000 161 0.97 27.2 0.0104
[ModuleOutput] 524000 478 0.8 325 0.172
[ModuleOutput] 1050000 1180 0.5 1010 0.204
[ModuleOutput] 2100000 3240 0.148 2260 0.117
[ModuleOutput] 4190000 6000 0.0604 6400 0.0527
[ModuleOutput] [ModuleOutput] Tuning parameter ‘sigma’ was held constant at a value of 0.0037653 [ModuleOutput] RMSE was used to select the optimal model using the smallest value. [ModuleOutput] The final values used for the model were sigma = 0.00377 and C = 2048.
svmRadTuneGrid <- data.frame(.sigma = c(0.0037653,0.0037653,0.0037653,0.0037653,0.0037653),
.C = c(2048*.7,2048*.9,2048,2048*1.1,2048*1.3))
fitControl <- trainControl(
method = "repeatedcv", #repeatedcv,
number = 10, #4
repeats = 10, #10
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
svmRadGridModel = train(Hunters ~ ., data = COElkHunters,
method = "svmRadial",
tuneGrid = svmRadTuneGrid,
trControl = fitControl)
svmRadGridModel
Support Vector Machines with Radial Basis Function Kernel
1540 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1385, 1385, 1385, 1386, 1386, 1386, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
1433.6 139.2026 0.9774819 94.04650
1843.2 139.0134 0.9775285 93.81404
2048.0 139.0110 0.9775067 93.69544
2252.8 139.0443 0.9774694 93.63952
2662.4 139.1025 0.9774175 93.69682
Tuning parameter 'sigma' was held constant at a value of 0.0037653
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0037653 and C = 2048.
Best RMSE, not sure why caret is selecting parameters with higher RMSE, lets select manually
RSMEsvmRad <- filter(svmRadGridModel$results, RMSE == min(RMSE))
run again with a tune grid
svmRadTuneGrid2 <- data.frame(.sigma = c(RSMEsvmRad$sigma*.7,RSMEsvmRad$sigma*.9,RSMEsvmRad$sigma,RSMEsvmRad$sigma*1.1,RSMEsvmRad$sigma*1.3),
.C = c(RSMEsvmRad$C,RSMEsvmRad$C,RSMEsvmRad$C,RSMEsvmRad$C,RSMEsvmRad$C))
registerDoSEQ()
registerDoMC(cores = 6)
svmRadGridModel2 = train(Hunters ~ ., data = COElkHunters,
method = "svmRadial",
tuneGrid = svmRadTuneGrid2,
trControl = fitControl)
svmRadGridModel2
Support Vector Machines with Radial Basis Function Kernel
1540 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 1385, 1386, 1386, 1387, 1385, 1386, ...
Resampling results across tuning parameters:
sigma RMSE Rsquared MAE
0.00263571 137.7150 0.9779063 92.80793
0.00338877 137.2732 0.9780733 92.75156
0.00376530 137.2189 0.9780935 92.77121
0.00414183 137.5238 0.9780036 93.07738
0.00489489 138.4695 0.9777092 94.14245
Tuning parameter 'C' was held constant at a value of 2048
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0037653 and C = 2048.
RSMEsvmRad <- filter(svmRadGridModel2$results, RMSE == min(RMSE))
Work thru some resampling methods with best kknn params
svmRadTuneGrid3 <- data.frame(.sigma = RSMEsvmRad$sigma,
.C = RSMEsvmRad$C)
trainmethods <- c("boot", "boot632", "optimism_boot", "cv", "repeatedcv", "LOOCV", "LGOCV", "none")
trainmethodperformance_all <- NULL
for (itrainmethod in trainmethods) {
trainmethodperformance <- NULL
fitControl <- trainControl(
method = itrainmethod,
number = 10, #4
repeats = 10, #10
allowParallel = TRUE,
summaryFunction = defaultSummary)
registerDoSEQ()
registerDoMC(cores = 6)
svmRadTrainModel = train(Hunters ~ ., data = COElkHunters,
method = "svmRadial",
tuneGrid = svmRadTuneGrid3,
trControl = fitControl)
print(svmRadTrainModel)
trainmethodperformance <- filter(svmRadTrainModel$results, RMSE == min(RMSE))
trainmethodperformance$trainmethod <- itrainmethod
trainmethodperformance_all <- rbind.fill(trainmethodperformance_all,trainmethodperformance)
}
trainmethodperformance_all
sigma C RMSE Rsquared MAE RMSESD RsquaredSD MAESD trainmethod RMSEApparent
1 0.0037653 2048 360.4005 0.7726028 231.54142 282.30458 0.298028249 191.167393 boot NA
2 0.0037653 2048 193.7360 0.9278034 123.00804 200.64763 0.209256870 132.918016 boot632 94.5999
3 0.0037653 2048 131.0365 0.9805517 83.53679 19.64063 0.008462524 7.577739 optimism_boot 94.5999
4 0.0037653 2048 135.4261 0.9789121 92.45378 11.02942 0.003095162 5.487161 cv NA
5 0.0037653 2048 141.6488 0.9759367 94.08007 32.98579 0.017374430 9.333094 repeatedcv NA
6 0.0037653 2048 133.2932 0.9794390 90.18903 NA NA NA LOOCV NA
7 0.0037653 2048 150.4885 0.9732360 99.97503 13.06248 0.004776229 6.047319 LGOCV NA
RsquaredApparent MAEApparent RMSEOptimism RsquaredOptimism MAEOptimism
1 NA NA NA NA NA
2 0.9898111 69.14552 NA NA NA
3 0.9898111 69.14552 36.43665 -0.009259344 14.39127
4 NA NA NA NA NA
5 NA NA NA NA NA
6 NA NA NA NA NA
7 NA NA NA NA NA
fitControl <- trainControl(
method = "optimism_boot",
number = 10, #4
allowParallel = TRUE,
summaryFunction = defaultSummary)
svmRadFinalTrainModel = train(Hunters ~ ., data = COElkHunters,
method = "svmRadial",
tuneGrid = svmRadTuneGrid3,
trControl = fitControl)
save off for future loading
save(svmRadFinalTrainModel, file = "~/_code/colorado-dow/datasets/svmRadFinalTrainModel.RData")
back to train vs test data for one more performance measure and chart… even though for future data we will use the final trained model
svmRadTrainModel = train(Hunters ~ ., data = traindata,
method = "svmRadial",
tuneGrid = svmRadTuneGrid3,
trControl = fitControl)
check performance
predictdata <- predict(svmRadTrainModel, testdata)
postResample(pred = predictdata, obs = testdata$Hunters)
Chart performance of predicted
chartperformance <- data.frame(predicted = predictdata, observed = testdata$Hunters)
ggplot(chartperformance, aes(predicted,observed)) +
geom_point() +
labs(title="Performance of Number of Hunters Prediction", caption="source: cpw.state.co.us")
kknn performed better than svmRadial RMSE=130 vs 154
FinalHuntersmodel <- kknnFinalTrainModel
# FinalHuntersmodel <- svmRadFinalTrainModel
save(FinalHuntersmodel, file = "~/_code/colorado-dow/datasets/FinalHuntersmodel.RData")
Use the 2018 Draw data to predict the number of hunters in 2018 COElkHunters COElkDraw2018 <- filter(COElkDraw, Year == 2018) COElkHunters2018 <- COElkDraw2018[, colnames(COElkDraw2018) %in% c(“Unit”,FinalHuntersmodel$coefnames)]
COElkHunters2018 <- as.data.frame(unique(COElkHunters\(Unit)) colnames(COElkHunters2018) <- "Unit" COElkHunters2018\)Year <- 2018 COElkHunters2018 <- left_join(COElkHunters2018,filter(COElkDraw,Year==2018)) Replace the draw data that don’t have entries with 0 COElkHunters2018\(Drawn[is.na(COElkHunters2018\)Drawn)] <- 0 COElkHunters2018\(Quota[is.na(COElkHunters2018\)Quota)] <- 0
COElkHunters2018 <- COElkHunters2018[, colnames(COElkHunters2018) %in% c(“Unit”,FinalHuntersmodel$coefnames)]
COElkHunters2018$Hunters <- round(predict(FinalHuntersmodel, COElkHunters2018))
COElkHunters2018\(Hunters[COElkHunters2018\)Hunters<0] <- 0
Save off so we don’t have to recreate the model everytime we want the results save(COElkHunters2018,file=“COElkHunters2018.RData”)
COElkHuntersStatewide <- summarise(group_by(COElkRifleAll,Year,Unit), Hunters = sum(c(Hunters.Antlered,Hunters.Antlerless,Hunters.Either),na.rm = T)) COElkHunters2018b <- COElkHunters2018 # COElkHunters2018b\(Year <- as.character(COElkHunters2018b\)Year)
Join 2018 to historic data COElkHuntersAll <- rbind.fill(COElkHuntersStatewide,COElkHunters2018b)
COElkHuntersStatewide <- summarise(group_by(COElkHuntersAll,Year), Hunters = sum(Hunters))
ggplot(COElkHuntersStatewide, aes(Year,Hunters)) + geom_bar(stat=“identity”,fill=ggthemes_data\(hc\)palettes$default[2]) + coord_cartesian(ylim = c(130000,155000)) + labs(title=“Statewide Elk Hunters”, caption=“source: cpw.state.co.us”)
ggplot(filter(COElkHuntersAll,Unit == “77”), aes(Year,Hunters)) + geom_point() + # geom_line() + # scale_x_date() + # geom_bar(stat=“identity”,fill=ggthemes_data\(hc\)palettes$default[2]) + # coord_cartesian(ylim = c(110000,160000)) + labs(title=“Statewide Elk Hunters”, caption=“source: cpw.state.co.us”)
TODO commentary
I’d like to know where the hunters are distributed across the state.
Next year’s data Year2018 <- filter(COElkHuntersAll, Year == “2018”) HunterstoPlot <- left_join(Unitboundaries2,Year2018, by=c(“Unit”))
ggplot(HunterstoPlot, aes(long, lat, group = group)) + geom_polygon(aes(fill = Hunters),colour = “grey50”, size = .2) + #Unit boundaries geom_path(data = COroads,aes(x = long, y = lat, group = group), color=“#3878C7”,size=2) + #Roads geom_text(data=data_centroids,aes(x=longitude,y=latitude,label = Unit),size=3) + #Unit labels scale_fill_distiller(palette = “Oranges”,direction = 1,na.value = ‘light grey’) + xlab(“”) + ylab(“”) + labs(title=“Predicted 2018 Colorado Elk Hunters”, caption=“source: cpw.state.co.us”) > TODO - commentary
Create a png of each year icounter <- 0 for (imap in unique(COElkHuntersAll\(Year)){ # Colorado aspect ratio = 1087w x 800h -> 1.35875 # Use trial and error to determine which width and height to define for png files that will retain the correct aspect ratio png(file=paste("HuntersMap",imap,".png"), width=948, height=700) yearplot <- filter(COElkHuntersAll, Year == imap) HunterstoPlot <- left_join(Unitboundaries2,yearplot, by=c("Unit")) p1 <- ggplot(HunterstoPlot, aes(long, lat, group = group)) + geom_polygon(aes(fill = Hunters),colour = "grey50", size = .2) + #Unit boundaries geom_path(data = COroads,aes(x = long, y = lat, group = group), color="#3878C7",size=2) + #Roads geom_text(data=data_centroids,aes(x=longitude,y=latitude,label = Unit),size=5) + #Unit labels scale_fill_distiller(palette = "Oranges", direction = 1, na.value = 'light grey', limits = c(0,max(COElkHuntersAll\)Hunters))) + #fix so each year chart has same color breaks xlab(“”) + ylab(“”) + theme(plot.title=element_text(hjust = .5)) + theme(plot.subtitle=element_text(hjust = icounter/length(unique(COElkHuntersAll$Year)))) + labs(title=“Colorado Elk Hunters”, subtitle=imap, caption=“source: cpw.state.co.us”) plot(p1) dev.off() icounter <- icounter + 1 }
Convert the .png files to one .gif file using ImageMagick. system(“convert -delay 150 *.png HuntersmapPred.gif“)
TODO - commentary
Remove the .png files file.remove(list.files(pattern=“.png”)) *** ### Number of Hunters Rank of the Units Would also be beneficial to rank each unit so I can reference later. In this case average the number of hunters of the last few years HunterRank2018 <- filter(COElkHuntersAll, as.numeric(Year) == 2018) HunterRank2018 <- summarise(group_by(HunterRank2018,Unit), Hunters = mean(Hunters,na.rm = T)) HunterRank2018\(HuntersRank = rank(-HunterRank2018\)Hunters)
HunterRank2018 <- filter(HunterRank2018, HuntersRank <= 50) # top 50 units In order for the chart to retain the order of the rows, the X axis variable (i.e. the categories) has to be converted into a factor. HunterRank2018 <- HunterRank2018[order(-HunterRank2018$Hunters), ] # sort HunterRank2018\(Unit <- factor(HunterRank2018\)Unit, levels = HunterRank2018$Unit) # to retain the order in plot.
Lollipop Chart ggplot(HunterRank2018, aes(x=Unit, y=Hunters)) + geom_point(size=3) + geom_segment(aes(x=Unit, xend=Unit, y=0, yend=Hunters)) + labs(title=“Predicted Elk Hunters 201850 Units”, subtitle=“Hunters by Unit”, caption=“source: cpw.state.co.us”) > TODO - commentary
TODO #’